Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

نویسندگان

  • Dan Klein
  • Christopher D. Manning
چکیده

This paper presents empirical studies and closely corresponding theoretical models of the performance of a chart parser exhaustively parsing the Penn Treebank with the Treebank’s own CFG grammar. We show how performance is dramatically affected by rule representation and tree transformations, but little by top-down vs. bottom-up strategies. We discuss grammatical saturation, including analysis of the strongly connected components of the phrasal nonterminals in the Treebank, and model how, as sentence length increases, the effective grammar rule size increases as regions of the grammar are unlocked, yielding super-cubic observed time behavior in some configurations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یک مدل بیزی برای استخراج باناظر گرامر زبان طبیعی

In this paper, we show that the problem of grammar induction could be modeled as a combination of several model selection problems. We use the infinite generalization of a Bayesian model of cognition to solve each model selection problem in our grammar induction model. This Bayesian model is capable of solving model selection problems, consistent with human cognition. We also show that using th...

متن کامل

Unlexicalised Hidden Variable Models of Split Dependency Grammars

This paper investigates transforms of split dependency grammars into unlexicalised context-free grammars annotated with hidden symbols. Our best unlexicalised grammar achieves an accuracy of 88% on the Penn Treebank data set, that represents a 50% reduction in error over previously published results on unlexicalised dependency parsing.

متن کامل

Robust PCFG-Based Generation Using Automatically Acquired LFG Approximations

Wide coverage grammars automatically extracted from treebanks are a corner-stone technology in state-ofthe-art probabilistic parsing. They achieve robustness and coverage at a fraction of the development cost of hand-crafted grammars. It is surprising to note that to date, such grammars do not usually figure in the complementary operation to parsing – natural language surface realisation. Banga...

متن کامل

Treebank vs. Xbar-based Automatic F-structure Annotation Treebank vs. Xbar-based Automatic F-structure Annotation

Manual, large scale (computational) grammar development is time consuming, expensive and requires lots of linguistic expertise. More recently, a number of alternatives based on treebank resources (such as Penn-II, Susanne, AP treebank) have been explored. The idea is to automatically \induce" or rather read oo (P)CFG grammars from the parse annotated treebank resources and to use the treebank g...

متن کامل

Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing

This paper reports the development of loglinear models for the disambiguation in wide-coverage HPSG parsing. The estimation of log-linear models requires high computational cost, especially with widecoverage grammars. Using techniques to reduce the estimation cost, we trained the models using 20 sections of Penn Treebank. A series of experiments empirically evaluated the estimation techniques, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001